Data Visualization#

Load data#

Hide code cell source
import pandas as pd
import sys
sys.path.append('../')
from source.plots import *
output_notebook()

file_path = '../data/'
model_name = 'AML Epigenomic Risk'

# Read the data
df = pd.read_excel(file_path + 'alma_main_results.xlsx', index_col=0).sort_index()
sig_results = pd.read_excel(file_path + 'signature_results.xlsx', index_col=0).sort_index()

df = df.join(sig_results)

# Define train and test samples
df_train = df[df['Train-Test']=='Train Sample']
df_test = df[df['Train-Test'] == 'Test Sample']

# remove duplicates from the test cohort
df_test = df_test[~df_test['Patient_ID'].duplicated(keep='last')]

# Prognostic model samples
df_px = df[~df['Vital Status at 5y'].isna()]
df_px2 = df_px[df_px['Clinical Trial'].isin(['AAML0531', 'AAML1031', 'AAML03P1'])]
df_px2 = df_px2[df_px2['Sample Type'].isin(
    ['Diagnosis', 'Primary Blood Derived Cancer - Bone Marrow', 'Primary Blood Derived Cancer - Peripheral Blood'])]
df_px2 = df_px2[~df_px2['Patient_ID'].duplicated(keep='last')]

# drop the samples with missing labels for the ELN AML 2022 Diagnosis
df_dx = df_train[~df_train['WHO 2022 Diagnosis'].isna()]

# exclude the classes with fewer than 5 samples
df_dx = df_dx[~df_dx['WHO 2022 Diagnosis'].isin(['AML with t(9;22); BCR::ABL1'])]

df_px_ = df_px.sort_values(by='P(Death) at 5y').reset_index().reset_index(names=['Percentile']).set_index('index')
df_px_['Percentile'] = df_px_['Percentile'] / len(df_px_['Percentile'])
df2 = df.join(df_px_[['Percentile']])
Loading BokehJS ...

Interactive atlas#

Hide code cell source
from source.alma_plot import *

plot_alma(df2, save_html=True)

Patient Characteristics#

ALMA (unsupervised)#

Hide code cell source
from tableone import TableOne
from datetime import date

columns = ['Hematopoietic Entity','Age (group years)','Sex',
            'Clinical Trial',]

mytable_cog = TableOne(df_train.reset_index(), columns,
                        overall=False, missing=False,
                        pval=False, pval_adjust=False,
                        htest_name=True,dip_test=True,
                        tukey_test=True, normal_test=True,

                        order={'FLT3 ITD':['Yes','No'],
                                'Age (group years)':['0-5','5-13','13-39','39-60'],
                                'MRD 1 Status': ['Positive'],
                                'Risk Group': ['High Risk', 'Standard Risk'],
                                'FLT3 ITD': ['Yes'],
                                'Leucocyte counts (10⁹/L)': ['≥30'],
                                'Age group (years)': ['≥10']})

mytable_cog.to_excel('../data/pt_characteristics_alma_model_' + str(date.today()) +'.xlsx')

mytable_cog.tabulate(tablefmt="html", 
                        # headers=[score_name,"",'Missing','Discovery','Validation','p-value','Statistical Test']
                        )
Hide code cell output
Overall
n 3314
Hematopoietic Entity, n (%)Acute lymphoblastic leukemia (ALL) 700 (28.3)
Acute myeloid leukemia (AML) 1221 (49.4)
Acute promyelocytic leukemia (APL) 31 (1.3)
Mixed phenotype acute leukemia (MPAL) 48 (1.9)
Myelodysplastic syndrome (MDS or MDS-like)223 (9.0)
Otherwise-Normal (Control) 251 (10.1)
Age (group years), n (%) 0-5 480 (24.1)
5-13 483 (24.2)
13-39 663 (33.2)
39-60 165 (8.3)
60+ 203 (10.2)
Sex, n (%) Female 885 (49.1)
Male 918 (50.9)
Clinical Trial, n (%) AAML03P1 72 (2.2)
AAML0531 628 (18.9)
AAML1031 587 (17.7)
BM normal AAML0531 41 (1.2)
Beat AML Consortium 316 (9.5)
CCG2961 41 (1.2)
CETLAM SMD-09 (MDS-tAML) 166 (5.0)
French GRAALL 2003–2005 141 (4.3)
Japanese AML05 64 (1.9)
NOPHO ALL92-2000 933 (28.2)
TARGET ALL 131 (4.0)
TCGA AML 194 (5.9)

Fine-tuned (supervised) Dx and Px models#

Hide code cell source
columns = ['Age (years)','Age group (years)','Sex','Race or ethnic group',
            'Hispanic or Latino ethnic group', 'MRD 1 Status',
            'Leucocyte counts (10⁹/L)', 'BM leukemic blasts (%)',
            'Risk Group','FLT3 ITD', 'Clinical Trial']

df_test['Age (years)'] = df_test['Age (years)'].astype(float)

# join discovery clinical data with validation clinical data
all_cohorts = pd.concat([df_dx, df_px2, df_test],
                         axis=0, keys=['Dx Discovery','Px Discovery' ,'Validation'],
                         names=['cohort']).reset_index()

# columns = ['Age group (years)','Sex', 'MRD 1 Status',
#             'Leucocyte counts (10⁹/L)',
#             'Risk Group','FLT3 ITD', 'Treatment Arm','Clinical Trial']

mytable_cog = TableOne(all_cohorts, columns,
                        overall=False, missing=False,
                        pval=False, pval_adjust=False,
                        htest_name=True,dip_test=True,
                        tukey_test=True, normal_test=True,

                        order={'FLT3 ITD':['Yes','No'],
                                'Race or ethnic group':['White','Black or African American','Asian'],
                                'MRD 1 Status': ['Positive'],
                                'Risk Group': ['High Risk', 'Standard Risk'],
                                'FLT3 ITD': ['Yes'],
                                'Leucocyte counts (10⁹/L)': ['≥30'],
                                'Age group (years)': ['≥10']},
                                groupby='cohort')

mytable_cog.to_excel('../data/pt_characteristics_fine-tuned_models_' + str(date.today()) +'.xlsx')

mytable_cog.tabulate(tablefmt="html", 
                        # headers=[score_name,"",score_name,'Validation','p-value','Statistical Test']
)
Hide code cell output
Dx Discovery Px Discovery Validation
n 2471 946 200
Age (years), mean (SD) 19.2 (19.7) 9.4 (6.3) 8.8 (6.0)
Age group (years), n (%) ≥10 528 (47.4) 463 (48.9) 95 (48.0)
<10 586 (52.6) 483 (51.1) 103 (52.0)
Sex, n (%) Female 711 (50.5) 468 (49.5) 86 (43.0)
Male 697 (49.5) 478 (50.5) 114 (57.0)
Race or ethnic group, n (%) White 1064 (80.5) 697 (79.1) 142 (71.7)
Black or African American 131 (9.9) 102 (11.6) 32 (16.2)
Asian 65 (4.9) 43 (4.9) 1 (0.5)
American Indian or Alaska Native7 (0.5) 5 (0.6)
Other 48 (3.6) 28 (3.2) 21 (10.6)
Pacific Islander 7 (0.5) 6 (0.7) 2 (1.0)
Hispanic or Latino ethnic group, n (%)Hispanic or Latino 209 (19.6) 185 (20.2) 25 (12.6)
Not Hispanic or Latino 858 (80.4) 731 (79.8) 173 (87.4)
MRD 1 Status, n (%) Positive 284 (29.6) 260 (31.5) 76 (40.4)
Negative 675 (70.4) 566 (68.5) 112 (59.6)
Leucocyte counts (10⁹/L), n (%) ≥30 579 (52.4) 467 (49.4) 87 (43.7)
<30 526 (47.6) 479 (50.6) 112 (56.3)
BM leukemic blasts (%), mean (SD) 65.7 (24.1) 63.8 (24.5) 60.2 (25.6)
Risk Group, n (%) High Risk 198 (14.2) 129 (13.8) 51 (25.5)
Standard Risk 628 (45.0) 454 (48.7) 86 (43.0)
Low Risk 570 (40.8) 349 (37.4) 63 (31.5)
FLT3 ITD, n (%) Yes 180 (16.2) 165 (17.5) 31 (15.7)
No 932 (83.8) 779 (82.5) 167 (84.3)
Clinical Trial, n (%) AAML03P1 62 (2.5) 36 (3.8)
AAML0531 517 (20.9) 507 (53.6)
AAML1031 495 (20.0) 403 (42.6)
BM normal AAML0531 41 (1.7)
Beat AML Consortium 192 (7.8)
CCG2961 31 (1.3)
CETLAM SMD-09 (MDS-tAML) 166 (6.7)
French GRAALL 2003–2005 141 (5.7)
Japanese AML05 9 (0.4)
NOPHO ALL92-2000 641 (25.9)
TARGET ALL 56 (2.3)
TCGA AML 120 (4.9)
AML02 158 (79.0)
AML08 42 (21.0)

By prognostic group#

Discovery#

AML Epigenomic Risk

Hide code cell source
def pt_characteristics_by_model(df, model_name, traintest = 'discovery'):
        columns = ['Age (years)','Age group (years)','Sex','Race or ethnic group',
                'Hispanic or Latino ethnic group', 'MRD 1 Status',
                'Leucocyte counts (10⁹/L)', 'BM leukemic blasts (%)',
                'Risk Group', 'Clinical Trial','FLT3 ITD', 'Treatment Arm']

        mytable_cog = TableOne(df, columns,
                                overall=False, missing=False,
                                pval=True, pval_adjust=False,
                                htest_name=True,dip_test=True,
                                tukey_test=True, normal_test=True,

                                order={'FLT3 ITD':['Yes','No'],
                                        'Race or ethnic group':['White','Black or African American','Asian'],
                                        'MRD 1 Status': ['Positive'],
                                        'Risk Group': ['High Risk', 'Standard Risk'],
                                        'FLT3 ITD': ['Yes'],
                                        'Leucocyte counts (10⁹/L)': ['≥30'],
                                        'Age group (years)': ['≥10']},
                                groupby=model_name)

        mytable_cog.to_excel('../data/pt_characteristics_'+ model_name +'_' + traintest + '_' + str(date.today()) + '.xlsx')

        return(mytable_cog.tabulate(tablefmt="html", 
                                headers=[model_name + ' ' + traintest,"",'High','Low','p-value','Statistical Test']))

pt_characteristics_by_model(df_px2, model_name, 'Discovery')
Hide code cell output
AML Epigenomic Risk Discovery High Low p-value Statistical Test
n 453 493
Age (years), mean (SD) 8.4 (6.5) 10.4 (6.0) <0.001 Two Sample T-test
Age group (years), n (%) ≥10 193 (42.6) 270 (54.8) <0.001 Chi-squared
<10 260 (57.4) 223 (45.2)
Sex, n (%) Female 226 (49.9) 242 (49.1) 0.856 Chi-squared
Male 227 (50.1) 251 (50.9)
Race or ethnic group, n (%) White 332 (78.7) 365 (79.5) 0.424 Chi-squared (warning: expected count < 5)
Black or African American 55 (13.0) 47 (10.2)
Asian 18 (4.3) 25 (5.4)
American Indian or Alaska Native3 (0.7) 2 (0.4)
Other 10 (2.4) 18 (3.9)
Pacific Islander 4 (0.9) 2 (0.4)
Hispanic or Latino ethnic group, n (%)Hispanic or Latino 87 (19.8) 98 (20.5) 0.848 Chi-squared
Not Hispanic or Latino 352 (80.2) 379 (79.5)
MRD 1 Status, n (%) Positive 166 (41.5) 94 (22.1) <0.001 Chi-squared
Negative 234 (58.5) 332 (77.9)
Leucocyte counts (10⁹/L), n (%) ≥30 203 (44.8) 264 (53.5) 0.009 Chi-squared
<30 250 (55.2) 229 (46.5)
BM leukemic blasts (%), mean (SD) 65.7 (26.1)62.0 (22.9)0.027 Two Sample T-test
Risk Group, n (%) High Risk 87 (19.6) 42 (8.6) <0.001 Chi-squared
Standard Risk 330 (74.2) 124 (25.5)
Low Risk 28 (6.3) 321 (65.9)
Clinical Trial, n (%) AAML03P1 21 (4.6) 15 (3.0) 0.020 Chi-squared
AAML0531 222 (49.0) 285 (57.8)
AAML1031 210 (46.4) 193 (39.1)
FLT3 ITD, n (%) Yes 87 (19.2) 78 (15.9) 0.198 Chi-squared
No 365 (80.8) 414 (84.1)
Treatment Arm, n (%) Arm A 114 (46.9) 144 (48.2) 0.839 Chi-squared
Arm B 129 (53.1) 155 (51.8)

MethylScoreAML-37CpGs

Hide code cell source
pt_characteristics_by_model(df_px2, model_name='MethylScoreAML Categorical', traintest='Discovery')
Hide code cell output
MethylScoreAML Categorical Discovery High Low p-value Statistical Test
n 176 770
Age (years), mean (SD) 9.2 (6.5) 9.5 (6.3) 0.666 Two Sample T-test
Age group (years), n (%) ≥10 88 (50.0) 375 (48.7) 0.820 Chi-squared
<10 88 (50.0) 395 (51.3)
Sex, n (%) Female 86 (48.9) 382 (49.6) 0.924 Chi-squared
Male 90 (51.1) 388 (50.4)
Race or ethnic group, n (%) White 131 (79.4) 566 (79.1) 0.138 Chi-squared (warning: expected count < 5)
Black or African American 26 (15.8) 76 (10.6)
Asian 5 (3.0) 38 (5.3)
American Indian or Alaska Native1 (0.6) 4 (0.6)
Other 2 (1.2) 26 (3.6)
Pacific Islander 6 (0.8)
Hispanic or Latino ethnic group, n (%)Hispanic or Latino 34 (20.1) 151 (20.2) 1.000 Chi-squared
Not Hispanic or Latino 135 (79.9) 596 (79.8)
MRD 1 Status, n (%) Positive 64 (43.5) 196 (28.9) 0.001 Chi-squared
Negative 83 (56.5) 483 (71.1)
Leucocyte counts (10⁹/L), n (%) ≥30 82 (46.6) 385 (50.0) 0.464 Chi-squared
<30 94 (53.4) 385 (50.0)
BM leukemic blasts (%), mean (SD) 72.5 (21.8)61.8 (24.7)<0.001 Two Sample T-test
Risk Group, n (%) High Risk 31 (17.9) 98 (12.9) <0.001 Chi-squared
Standard Risk 132 (76.3) 322 (42.4)
Low Risk 10 (5.8) 339 (44.7)
Clinical Trial, n (%) AAML03P1 6 (3.4) 30 (3.9) 0.729 Chi-squared
AAML0531 99 (56.2) 408 (53.0)
AAML1031 71 (40.3) 332 (43.1)
FLT3 ITD, n (%) Yes 27 (15.4) 138 (17.9) 0.496 Chi-squared
No 148 (84.6) 631 (82.1)
Treatment Arm, n (%) Arm A 56 (53.3) 202 (46.2) 0.230 Chi-squared
Arm B 49 (46.7) 235 (53.8)

Validation#

AML Epigenomic Risk

Hide code cell source
pt_characteristics_by_model(df_test, model_name, 'validation')
Hide code cell output
AML Epigenomic Risk validation High Low p-value Statistical Test
n 80 120
Age (years), mean (SD) 7.4 (6.2) 9.6 (5.7) 0.013 Two Sample T-test
Age group (years), n (%) ≥10 31 (39.2) 64 (53.8) 0.063 Chi-squared
<10 48 (60.8) 55 (46.2)
Sex, n (%) Female 34 (42.5) 52 (43.3) 1.000 Chi-squared
Male 46 (57.5) 68 (56.7)
Race or ethnic group, n (%) White 61 (78.2) 81 (67.5) 0.173 Chi-squared (warning: expected count < 5)
Black or African American11 (14.1) 21 (17.5)
Asian 1 (1.3)
Other 4 (5.1) 17 (14.2)
Pacific Islander 1 (1.3) 1 (0.8)
Hispanic or Latino ethnic group, n (%)Hispanic or Latino 13 (16.5) 12 (10.1) 0.270 Chi-squared
Not Hispanic or Latino 66 (83.5) 107 (89.9)
MRD 1 Status, n (%) Positive 34 (46.6) 42 (36.5) 0.224 Chi-squared
Negative 39 (53.4) 73 (63.5)
Leucocyte counts (10⁹/L), n (%) ≥30 31 (39.2) 56 (46.7) 0.375 Chi-squared
<30 48 (60.8) 64 (53.3)
BM leukemic blasts (%), mean (SD) 65.3 (26.7)56.9 (24.5)0.037 Two Sample T-test
Risk Group, n (%) High Risk 27 (33.8) 24 (20.0) <0.001 Chi-squared
Standard Risk 46 (57.5) 40 (33.3)
Low Risk 7 (8.8) 56 (46.7)
Clinical Trial, n (%) AML02 65 (81.2) 93 (77.5) 0.645 Chi-squared
AML08 15 (18.8) 27 (22.5)
FLT3 ITD, n (%) Yes 12 (15.2) 19 (16.0) 1.000 Chi-squared
No 67 (84.8) 100 (84.0)
Treatment Arm, n (%) Arm A 43 (55.1) 63 (52.5) 0.829 Chi-squared
Arm B 35 (44.9) 57 (47.5)

MethylScoreAML-37CpGs

Hide code cell source
pt_characteristics_by_model(df_test, model_name='MethylScoreAML Categorical', traintest='Validation')
Hide code cell output
MethylScoreAML Categorical Validation High Low p-value Statistical Test
n 48 152
Age (years), mean (SD) 7.8 (6.4) 9.1 (5.9) 0.207 Two Sample T-test
Age group (years), n (%) ≥10 20 (42.6) 75 (49.7) 0.493 Chi-squared
<10 27 (57.4) 76 (50.3)
Sex, n (%) Female 25 (52.1) 61 (40.1) 0.197 Chi-squared
Male 23 (47.9) 91 (59.9)
Race or ethnic group, n (%) White 35 (74.5) 107 (70.9) 0.170 Chi-squared (warning: expected count < 5)
Black or African American8 (17.0) 24 (15.9)
Asian 1 (2.1)
Other 2 (4.3) 19 (12.6)
Pacific Islander 1 (2.1) 1 (0.7)
Hispanic or Latino ethnic group, n (%) Hispanic or Latino 10 (21.3) 15 (9.9) 0.073 Chi-squared
Not Hispanic or Latino 37 (78.7) 136 (90.1)
MRD 1 Status, n (%) Positive 19 (41.3) 57 (40.1) 1.000 Chi-squared
Negative 27 (58.7) 85 (59.9)
Leucocyte counts (10⁹/L), n (%) ≥30 24 (51.1) 63 (41.4) 0.321 Chi-squared
<30 23 (48.9) 89 (58.6)
BM leukemic blasts (%), mean (SD) 71.2 (23.8)57.0 (25.3)0.002 Two Sample T-test
Risk Group, n (%) High Risk 10 (20.8) 41 (27.0) 0.001 Chi-squared
Standard Risk 31 (64.6) 55 (36.2)
Low Risk 7 (14.6) 56 (36.8)
Clinical Trial, n (%) AML02 34 (70.8) 124 (81.6) 0.164 Chi-squared
AML08 14 (29.2) 28 (18.4)
FLT3 ITD, n (%) Yes 5 (10.6) 26 (17.2) 0.393 Chi-squared
No 42 (89.4) 125 (82.8)
Treatment Arm, n (%) Arm A 24 (51.1) 82 (54.3) 0.825 Chi-squared
Arm B 23 (48.9) 69 (45.7)

Kaplan-Meier Plots#

Overall study population#

AML Epigenomic Risk

Hide code cell source
for dataset, trial in zip([df_px2, df_test], 
                          ['Discovery', 'Validation']):
    draw_kaplan_meier(model_name=model_name,
                        df=dataset,
                        save_survival_table=False,
                        save_plot=False,
                        show_ci=False,
                        add_risk_counts=False,
                        trialname=trial,
                        figsize=(8,8))
Hide code cell output
../_images/c02be4ac08424b0abd36e0378c05f5c16a2e6f5558cdd176a78bfc334503a2be.png ../_images/9ba1a571efd186170ccca637410e58c8f0d1027b40d104f90b8eb219c02a20e5.png

MethylScoreAML-37CpGs

Hide code cell source
for dataset, trial in zip([df_px2, df_test], 
                          ['Discovery', 'Validation']):
    draw_kaplan_meier(model_name='MethylScoreAML Categorical',
                        df=dataset,
                        save_survival_table=False,
                        save_plot=False,
                        show_ci=False,
                        add_risk_counts=False,
                        trialname=trial,
                        figsize=(8,8))
Hide code cell output
../_images/392c7dbaab7e0846b06b5037e51baebd0b808eeba5e80e56a1b9f2ff297c6716.png ../_images/5b41be1f00154f9c8427d8603651811129f55ffce5096c78da77cc83412733f6.png

Per risk group#

AML Epigenomic Risk

Hide code cell source
for dataset, trial in zip([df_px2, df_test], ['Discovery', 'Validation']):

    risk_groups = ['High Risk', 'Low Risk', 'Standard Risk']
    for risk_group in risk_groups:
        draw_kaplan_meier(
            model_name=model_name,
            df=dataset[dataset['Risk Group'] == risk_group],
            save_plot=False,
            save_survival_table=False,
            add_risk_counts=False,
            trialname=f'{trial} {risk_group}',
            figsize=(8, 8))
Hide code cell output
../_images/02f73253b0c5d78dc9db096756172384d5467f084a1d6696230a84a8dd1b5279.png ../_images/0f2954b27d3efe117a8ddd97e1e25d71f737cbcd030cc64c9d37356c9bbe45ee.png ../_images/6085ca1107921ad47bd8b983da4c834133fdc2b5cbedd46657bb510e6990e2a6.png ../_images/53c2f957ab660330169dff43156b207a505f1c9aac1407cca1470d7a54b20e26.png ../_images/b38a6ebb642eb3e2fc611c7555dcb0f70a822aeeb9614efd315f607b512636f0.png ../_images/294143ae788485d71fc37e663be373d3e12d5443bc6a81343697d99aff186c05.png

MethylScoreAML-37CpGs

Hide code cell source
for dataset, trial in zip([df_px2, df_test], ['Discovery', 'Validation']):

    risk_groups = ['High Risk', 'Low Risk', 'Standard Risk']
    for risk_group in risk_groups:
        draw_kaplan_meier(
            model_name= 'MethylScoreAML Categorical',
            df=dataset[dataset['Risk Group'] == risk_group],
            save_plot=False,
            save_survival_table=False,
            add_risk_counts=False,
            trialname=f'{trial} {risk_group}',
            figsize=(8, 8))
Hide code cell output
../_images/4b8efe14096f92063efae456cac16ea6566fcdc35576cca07279d4c4418e6449.png ../_images/6224d0edb588781ac597dafb22a60b4dd35a96eaa50161a1cdb7e4d3f5e55d65.png ../_images/2cf2559dddaf1d3a45a1ec023641f9f28a3e0f78a378829cb8b483a181a68643.png ../_images/33d08c00d48ef9e7cf1d9bee26bbee678fbfd8fe1e691496b2c5c3274229340a.png ../_images/bef007792553034ce510a2db21b8bae05dc9601d024ecc146f2aeee7a782c2f4.png ../_images/f3d3c2c06d20e7fead6c2b93af8dc05357d0ebc26d03b78e5fb7888a387ef33b.png

Per risk group (AAML1831 COG)#

AML Epigenomic Risk

Hide code cell source
for dataset, trial in zip([df_px2],['Discovery']):

    risk_groups = ['High', 'Low', 'Standard']
    for risk_group in risk_groups:
        draw_kaplan_meier(
            model_name=model_name,
            df=dataset[dataset['Risk Group AAML1831'] == risk_group],
            save_plot=False,
            save_survival_table=False,
            add_risk_counts=False,
            trialname=f'{trial} {risk_group} Risk',
            figsize=(8, 8))
Hide code cell output
../_images/ed22e57f706d1fabc191654f979ab208ea51776a9ac987df78d279d15b544cdb.png ../_images/537a9e2db52d5eb70db4bb5675ed649b6d9c0eb31316f10ed5af43afc01205d3.png ../_images/1505179aa013d5b499866be03ed13c26134bf698a6058105d7de0910479e82c2.png

MethylScoreAML-37CpGs

Hide code cell source
for dataset, trial in zip([df_px2],['Discovery']):

    risk_groups = ['High', 'Low', 'Standard']
    for risk_group in risk_groups:
        draw_kaplan_meier(
            model_name='MethylScoreAML Categorical',
            df=dataset[dataset['Risk Group AAML1831'] == risk_group],
            save_plot=False,
            save_survival_table=False,
            add_risk_counts=False,
            trialname=f'{trial} {risk_group} Risk',
            figsize=(8, 8))
Hide code cell output
../_images/f91f0d3cfbe452064dc19765ac46e5021aaf6ab5057a9c6fdcbb18537c34c3bd.png ../_images/8429ff0b8d7c16ad556f9c5d9d9da3e4903aab51f0c251ff33308c5cf3701809.png ../_images/006d449d1beb2749c2f01e3f51aa1eaaef9e3e6298eddd0aa883d097237b047a.png

Forest Plots#

With MRD 1 and BM blast (%)#

AML Epigenomic Risk

Hide code cell source
for dataset, trial in zip([df_px2, df_test], ['Discovery', 'Validation']):
    
    df_ = dataset.copy()
    df_['BM leukemic blasts (%)'] = pd.cut(df_['BM leukemic blasts (%)'], bins=[0,50,100], labels=['≤50', '>50'])
    df_['AML_Epigenomic_Risk'] = df_['AML Epigenomic Risk']
    df_['MethylScoreAML_Categorical'] = df_['MethylScoreAML Categorical']
    df_['os_time_5y'] = df_['os.time at 5y']
    df_['os_evnt_5y'] = df_['os.evnt at 5y']
    df_['efs_time_5y'] = df_['efs.time at 5y']
    df_['efs_evnt_5y'] = df_['efs.evnt at 5y']

    draw_forest_plot_withBMblast(time='os_time_5y',
                        event='os_evnt_5y',
                        df=df_,
                        trialname=trial,
                        model_name='AML_Epigenomic_Risk',
                        save_plot=False)

    draw_forest_plot_withBMblast(time='efs_time_5y',
                        event='efs_evnt_5y',
                        df=df_,
                        trialname=trial,
                        model_name='AML_Epigenomic_Risk',
                        save_plot=False)
Hide code cell output
../_images/739098c372ed1499fd2b0329403ec36f07afcee71d3622648226b5317aadd789.png ../_images/52cc9acb4055b6f707836534a6bbe214d8b699391085c03ab3c3a4a42e3aecd3.png ../_images/02607f931674f65cfa2fac08a8a56605cce28a39def4fb68ee507b3603c49ffd.png ../_images/c03ec12d4bbee6e0fa9d2d94c8167c17ac589fd0abc04d3a33120295ff69431f.png

MethylScoreAML-37CpGs

Hide code cell source
for dataset, trial in zip([df_px2, df_test], ['Discovery', 'Validation']):


    draw_forest_plot_withBMblast(time='os_time_5y',
                        event='os_evnt_5y',
                        df=df_,
                        trialname=trial,
                        model_name='MethylScoreAML_Categorical',
                        save_plot=False)

    draw_forest_plot_withBMblast(time='efs_time_5y',
                        event='efs_evnt_5y',
                        df=df_,
                        trialname=trial,
                        model_name='MethylScoreAML_Categorical',
                        save_plot=False)
Hide code cell output
../_images/64c5bdf9fddb860c1dff5b79cc1daad71c759ac2a94e0e7e924d3e6e639ce345.png ../_images/eb367675325c4c690d734c388d842a5f136cd05eae54a76b3f2baa8f14d8c846.png ../_images/8968725b86aa3a869890a7d35e5a9e00ac62673f56c49aff43c03618d8f93cad.png ../_images/d543eb07dae632024008490e2405c335e813a4d22d6b540de0ac9658a55b63ca.png

ROC AUC performance#

Diagnostic Model#

Hide code cell source
def process_dataset_for_multiclass_auc(df):
    # One hot encode `df_dx['AL Epigenomic Subtype']`
    df_dx_dummies = pd.get_dummies(df['WHO 2022 Diagnosis'])

    # transform boolean columns to integer
    df_dx_dummies = df_dx_dummies.astype(int)

    # join the one hot encoded columns with the original dataframe
    df_dx_auc = pd.concat([df.iloc[:, -33:-5], df_dx_dummies], axis=1)

    return df_dx_auc, df_dx_dummies

df_dx_auc_train, df_dx_dummies_train = process_dataset_for_multiclass_auc(df_dx)
df_dx_auc_cog, df_dx_dummies_cog = process_dataset_for_multiclass_auc(df_px2)
df_dx_auc_test, df_dx_dummies_test = process_dataset_for_multiclass_auc(df_test)

p1 = plot_multiclass_roc_auc(df_dx_auc_train, df_dx_dummies_train.columns, title='Discovery')
p2 = plot_multiclass_roc_auc(df_dx_auc_cog, df_dx_dummies_cog.columns, title='Discovery COG peds AML')
p3 = plot_multiclass_roc_auc(df_dx_auc_test, df_dx_dummies_test.columns, title='Validation')

# Create a gridplot
p = gridplot([
    [p1, p2, p3,],
    ], toolbar_location='above')

show(p)
Hide code cell output

Prognostic models#

Discovery#

Hide code cell source
df_cat = df_px2[['os.evnt at 5y', 'MethylScoreAML Categorical', 'AML Epigenomic Risk']]
df_cont = df_px2[['os.evnt at 5y', 'MethylScoreAML', 'P(Death) at 5y']]

df_cont = df_cont.rename(columns={'P(Death) at 5y':'AML Epigenomic Risk (PaCMAP-LGBM)',
                                  'MethylScoreAML': 'MethylScoreAML (EWAS-CoxPH)'})

df_cat = df_cat.rename(columns={'AML Epigenomic Risk':'AML Epigenomic Risk (PaCMAP-LGBM)',
                                  'MethylScoreAML Categorical': 'MethylScoreAML (EWAS-CoxPH)'})

risk = df_px2[['Risk Group AAML1831','Risk Group']]

low_high_dict = {'Low': 0, 'Low Risk': 0,
                'Standard':0.5, 'Standard Risk': 0.5,
                'High': 1, 'High Risk': 1}

risk['Risk Group'] = risk['Risk Group'].map(low_high_dict)
risk['Risk Group AAML1831'] = risk['Risk Group AAML1831'].map(low_high_dict)

df_cat['AML Epigenomic Risk (PaCMAP-LGBM)'] = df_cat['AML Epigenomic Risk (PaCMAP-LGBM)'].map(low_high_dict)
df_cat['MethylScoreAML (EWAS-CoxPH)'] = df_cat['MethylScoreAML (EWAS-CoxPH)'].map(low_high_dict)

df_cont_risk = df_cont.join(risk)
df_cat_risk = df_cat.join(risk)

df_cont_risk = df_cont_risk.fillna(0.5)
df_cat_risk = df_cat_risk.fillna(0.5)

p1 = plot_roc_auc(df_cont_risk, 'os.evnt at 5y',title= 'Continuous (prob. of high risk)')
p2 = plot_roc_auc(df_cat_risk, 'os.evnt at 5y',title= 'Categorical (high-low risk)')

# Create a gridplot
p = gridplot([[p1, p2]], toolbar_location='above')

show(p)
Hide code cell output

Validation#

Hide code cell source
df_cat = df_test[['os.evnt at 5y', 'MethylScoreAML Categorical', 'AML Epigenomic Risk']]
df_cont = df_test[['os.evnt at 5y', 'MethylScoreAML', 'P(Death) at 5y']]

df_cont = df_cont.rename(columns={'P(Death) at 5y':'AML Epigenomic Risk (PaCMAP-LGBM)',
                                  'MethylScoreAML': 'MethylScoreAML (EWAS-CoxPH)'})

df_cat = df_cat.rename(columns={'AML Epigenomic Risk':'AML Epigenomic Risk (PaCMAP-LGBM)',
                                  'MethylScoreAML Categorical': 'MethylScoreAML (EWAS-CoxPH)'})

risk = df_test[['Risk Group']]
risk['Risk Group'] = risk['Risk Group'].map(low_high_dict)

df_cat['AML Epigenomic Risk (PaCMAP-LGBM)'] = df_cat['AML Epigenomic Risk (PaCMAP-LGBM)'].map(low_high_dict)
df_cat['MethylScoreAML (EWAS-CoxPH)'] = df_cat['MethylScoreAML (EWAS-CoxPH)'].map(low_high_dict)

df_cont_risk_test = df_cont.join(risk)
df_cat_risk_test = df_cat.join(risk)

# Rename `Risk Group` to `Risk Group AML02,08`
df_cont_risk_test = df_cont_risk_test.rename(columns={'Risk Group':'Risk Group AML02-08'})
df_cat_risk_test = df_cat_risk_test.rename(columns={'Risk Group':'Risk Group AML02-08'})

p1 = plot_roc_auc(df_cont_risk_test, 'os.evnt at 5y',title= 'Continuous (prob. of high risk)')
p2 = plot_roc_auc(df_cat_risk_test, 'os.evnt at 5y',title= 'Categorical (high-low risk)')

# Create a gridplot
p = gridplot([[p1, p2]], toolbar_location='above')

show(p)
Hide code cell output

Pearson Correlation#

Discovery#

Hide code cell source
draw_scatter_pearson(df=df_cont_risk,x='MethylScoreAML (EWAS-CoxPH)', y='AML Epigenomic Risk (PaCMAP-LGBM)',s=20)

df_cont_risk.iloc[:,1:].corr().round(2)
../_images/51db62a866aa55d18c7478c2b946ff51f86312d7ceb5688c649126a9381ba75e.png
MethylScoreAML (EWAS-CoxPH) AML Epigenomic Risk (PaCMAP-LGBM) Risk Group AAML1831 Risk Group
MethylScoreAML (EWAS-CoxPH) 1.00 0.76 0.48 0.53
AML Epigenomic Risk (PaCMAP-LGBM) 0.76 1.00 0.54 0.59
Risk Group AAML1831 0.48 0.54 1.00 0.62
Risk Group 0.53 0.59 0.62 1.00

Validation#

Hide code cell source
draw_scatter_pearson(df=df_cont_risk_test,x='MethylScoreAML (EWAS-CoxPH)', y='AML Epigenomic Risk (PaCMAP-LGBM)',s=20)

df_cont_risk_test.iloc[:,1:].corr().round(2)
../_images/b6a591875aa5835ef6223892c2c6e97cbd9726be2170467c53c0018a438c2ca6.png
MethylScoreAML (EWAS-CoxPH) AML Epigenomic Risk (PaCMAP-LGBM) Risk Group AML02-08
MethylScoreAML (EWAS-CoxPH) 1.00 0.69 0.44
AML Epigenomic Risk (PaCMAP-LGBM) 0.69 1.00 0.48
Risk Group AML02-08 0.44 0.48 1.00

Sankey plots#

Note

Sankey plots below compare the distribution of categories. The width of the lines is proportional to the number of patients in each group.

Samples with annotated diagnosis info#

Hide code cell source
colors = get_custom_color_palette()


draw_sankey_plot(df_train, 'WHO 2022 Diagnosis', 'AL Epigenomic Subtype', colors,
                 title='Discovery cohort', fig_size=(4, 11),
                 fontsize=8, nan_action='drop')

draw_sankey_plot(df_px2, 'WHO 2022 Diagnosis', 'AL Epigenomic Subtype', colors,
                 title= 'Discovery cohort (COG peds AML Dx samples only)',fig_size=(4, 10),
                 fontsize=8, nan_action='drop')

draw_sankey_plot(df_test, 'WHO 2022 Diagnosis', 'AL Epigenomic Subtype', colors,
                 title= 'Validation cohort',fig_size=(3, 7),
                 fontsize=8, nan_action='drop')
Hide code cell output
../_images/463eab1ee35ed8cb8945783ebe19d343a2ce6aae81c0ca6adb2fba59d1a4eedd.png ../_images/d8d0ef5f1281789ecde55b10c9f9b7e001254a41feb19f691d0747e8c88736ac.png ../_images/9194bec4b28dfe76d45bd215fa32643ac208d68d6e935126101e4b809059f04c.png

Predictions in samples for which no WHO 22 Dx data was available#

Hide code cell source
draw_sankey_plot(df_train, 'WHO 2022 Diagnosis', 'AL Epigenomic Subtype', colors,
                 title='Discovery cohort', fig_size=(4, 9),
                 fontsize=8, nan_action='keep only')

draw_sankey_plot(df_px2, 'WHO 2022 Diagnosis', 'AL Epigenomic Subtype', colors,
                 title= 'Discovery cohort (COG peds AML Dx samples only)',fig_size=(4, 8),
                 fontsize=8, nan_action='keep only')

draw_sankey_plot(df_test, 'WHO 2022 Diagnosis', 'AL Epigenomic Subtype', colors,
                 title= 'Validation cohort',fig_size=(4, 8),
                 fontsize=8, nan_action='keep only')
Hide code cell output
../_images/dbcb338bb9e94c0c0e1c83dd51673aef386c3bbe6a974d96ace76f73236a260b.png ../_images/6349969df19167a2ea25fb6090e54c502e63d856a95f56eeaa3ad759998e6ffc.png ../_images/e220f50d10ce925d52587f735aaa29efddf024b2c4cee49e7bab80de49808636.png

Reason for unclassified samples#

Hide code cell source
draw_sankey_plot(df_train, 'WHO 2022 Diagnosis', 'Primary Cytogenetic Code', colors,
                 title='Discovery cohort', fig_size=(4, 6),
                 fontsize=8, nan_action='keep only')

draw_sankey_plot(df_px2, 'WHO 2022 Diagnosis', 'Gene Fusion', colors,
                 title= 'Discovery cohort (COG peds AML Dx samples only)',fig_size=(4, 9),
                 fontsize=8, nan_action='keep only')

draw_sankey_plot(df_test, 'WHO 2022 Diagnosis', 'Primary Cytogenetic Code', colors,
                 title= 'Validation cohort',fig_size=(2, 3),
                 fontsize=8, nan_action='keep only')
Hide code cell output
../_images/c1d8d78e6aaeb01a0f87a4353953988ecde2a66c3bb639fd5db312afc21db4a8.png ../_images/10c3ac9863bf5e4d949770a83545daa07982b475104cb9147cd92226e4a87446.png ../_images/91088196902648ce400921f432bb54660218c20bf4a0e51d6c5955f8f73a44b9.png

Risk group comparison in COG#

Hide code cell source
draw_sankey_plot(df_px2, 'Risk Group', 'Risk Group AAML1831', colors,
                 title= 'Discovery cohort (COG peds AML Dx samples only)',fig_size=(2, 4),
                 fontsize=8, nan_action='drop')

draw_sankey_plot(df_px2, 'Risk Group AAML1831', 'AML Epigenomic Risk', colors,
                 title= 'Discovery cohort (COG peds AML Dx samples only)',fig_size=(2, 4),
                 fontsize=8, nan_action='drop')
Hide code cell output
../_images/f83982441470711056fc53790dbdbb61af52b56f33d6c14a949bf7d6d755015a.png ../_images/f0cbc2deec40460686bd569c3f8ddd0cb990890c4669f4cad0925973c417a653.png

Px and Dx model comparison#

Hide code cell source
draw_sankey_plot(df_train, 'AML Epigenomic Risk', 'AL Epigenomic Subtype', colors,
                 title='Discovery cohort', fig_size=(3, 10),
                 fontsize=8, nan_action='drop')

draw_sankey_plot(df_px2, 'AML Epigenomic Risk', 'AL Epigenomic Subtype', colors,
                 title= 'Discovery cohort (COG peds AML Dx samples only)',fig_size=(3, 10),
                 fontsize=8, nan_action='drop')

draw_sankey_plot(df_test, 'AML Epigenomic Risk', 'AL Epigenomic Subtype', colors,
                 title= 'Validation cohort',fig_size=(3, 8),
                 fontsize=8, nan_action='drop')
Hide code cell output
../_images/3ac2c581c8596e87a17a3c03d48190b6443c12cb2dbc17d90d95de27141fcb5f.png ../_images/1d89e7491e83f5655f48a839f96dc6b3daabfc10ff6b78faf06e2aea412458bd.png ../_images/a6050201652be50362d3de0feb0080cdbfd7d7cd1acf919dd7db04a1bd051944.png

Performance metrics#

AML Epigenomic Risk#

Hide code cell source
plot_confusion_matrix_stacked(df_px2, df_test, 'os.evnt at 5y', 'AML Epigenomic Risk_int','os.evnt at 5y')
../_images/458ca6d8140a4b09d6662fc6fc0c470e733efa8033c1c976c2a0c778e95acf8b.png
Metrics:
|            |   Accuracy |   Sensitivity |   Specificity |   Precision |   F1-score |   AUC-ROC |
|:-----------|-----------:|--------------:|--------------:|------------:|-----------:|----------:|
| Train      |      0.705 |         0.757 |         0.676 |       0.565 |      0.647 |     0.717 |
| Validation |      0.7   |         0.667 |         0.714 |       0.5   |      0.571 |     0.69  |

MethylScoreAML#

Hide code cell source
plot_confusion_matrix_stacked(df_px2, df_test, 'os.evnt at 5y', 'MethylScoreAML_cat_bin','os.evnt at 5y')
../_images/ccafd5853b2930a9e97bdefafd7be391f3b6896a2109af9d7decb95db84c14e0.png
Metrics:
|            |   Accuracy |   Sensitivity |   Specificity |   Precision |   F1-score |   AUC-ROC |
|:-----------|-----------:|--------------:|--------------:|------------:|-----------:|----------:|
| Train      |       0.74 |         0.396 |         0.931 |       0.761 |      0.521 |     0.664 |
| Validation |       0.71 |         0.417 |         0.836 |       0.521 |      0.463 |     0.626 |

AL Epigenomic Subtype#

Hide code cell source
plot_confusion_matrix_stacked(df_dx, df_test, 'WHO 2022 Diagnosis', 'AL Epigenomic Subtype', 'WHO 2022 Diagnosis', figsize=(22,14))
../_images/0df59acb14f34115f127fd24c8a5dd2d955546925badc6b94cbce5ed641678fc.png
Metrics:
|            |   Accuracy |   Macro F1 |   Weighted F1 |   Cohen's Kappa |
|:-----------|-----------:|-----------:|--------------:|----------------:|
| Train      |      0.989 |      0.986 |         0.989 |           0.988 |
| Validation |      0.96  |      0.66  |         0.98  |           0.941 |

Watermark#

Author: Francisco_Marchi@Lamba_Lab_UF

Last updated: 2024-09-08

Python implementation: CPython
Python version       : 3.10.13
IPython version      : 8.20.0

pandas    : 2.2.0
seaborn   : 0.13.2
matplotlib: 3.8.2
tableone  : 0.8.0
sklearn   : 1.4.0
lifelines : 0.28.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 5.15.133.1-microsoft-standard-WSL2
Machine     : x86_64
Processor   : x86_64
CPU cores   : 32
Architecture: 64bit

Git repo: git@github.com:f-marchi/ALMA.git